Syntax and semantics in a distributed speech understanding system
نویسندگان
چکیده
SYNTAX AND SEMANTICS IN A DISTRIBUTED SPEECH UNDERSTANDING SYSTEM Frederick Hayes-Roth and David J. Mostow Computer Science Department* Carnegie-Mellon University Pittsburgh, Pa. 15213 The deve loped Hearsay II speech understanding system being at Carnegie-Mellon University has an independent Knowledge source module for each type of speech knowledge. Modules communicate by reading, writing, and modifying hypotheses about various constituents of the spoken utterance in a global data structure. The syntax and semantics module uses rules (productions) of four types: (1) recognition rules for generat ing a phrase hypothesis when its needed constituents have already been hypothesized; (2) prediction rules for inferr ing the l ikely presence of a word or phrase from previously recogn ized portions of the utterance; (3) respell ing rules for hypothes iz ing the constituents of a predicted phrase; and (4) postdict ion rules for supporting an existing hypothesis on the basis of additional confirming evidence. The rules are automatically generated from a declarative (ie^ non-procedural) descr ipt ion of the grammar and semantics, and are embedded in a paral lel recognit ion network for efficient retrieval of applicable rules. The current grammar uses a 450-word vocabulary and accepts simple English queries for an information retr ieval system. INTRODUCTION: THF PRORl FM The fundamental problem facing the syntax and semantics component of a speech understanding system is uncertainty. The system is uncertain about a variety of questions, including: whether a given word is really uttered by the speaker; when a recogn ized word begins and ends; whether a particular interval of the utterance contains a silence, a tilled pause ("er,M "urn," "uh"), an informationless interjection ( V k n o w , " "I mean"), or an information-bearing word or phrase; whether a recognized word or phrase is used in a particular sense; etc. Any decisions made on the basis of such uncertain information are potentially incorrect and must therefore be reversible. The classical method of revers ing decisions is backtracking. Backtracking and bes t f irst evaluation of alternative parses are the primary strategies employed by the Hearsay I speech understanding system (Reddy, et aL, 1973a, 1973b). In Hearsay II (Lesser, et aL, 1975) multiple alternatives are represented explicitly in a global data structure ("blackboard") and cons idered in parallel rather than one at a time as in Hearsay I. Process ing is dr iven by independent data-directed knowledge source modules (KSs) which create, examine, and revise hypotheses, stored on the blackboard, about the utterance. One dimension of the blackboard is level of representation: an interval of speech may be simultaneously represented at the acoustic, phonetic, phonemic, syllabic, word, phrasal, and conceptual levels. The KSs translate from one level to another with the ultimate object ive of representing the utterance at the conceptual level, i.e., understanding it. Hearsay II is a distributed logic system in that contro l of processing is distributed h ie ra r ch i ca l l y among the KSs rather than organized hierarchically. Each KS is respons ib le for deciding when it has useful information to contr ibute to the analysis of the input. The syntax and semantics KS in Hearsay II is called SASS, and deals with hypotheses representing words and phrases perce ived or expected in the utterance. From SASS's viewpoint, the b lackboard can be viewed as a chart of hypothesized words as in F igure 1, which represents the word hypotheses generated 007a Jlf ' *?e?cy u n d e r c o n t r a c t n o F44620-73-CPesearch b y t h < S A i r F o r c e 0 , , i c e o f Scientific by lowerleve l KSs in response to the utterance "Tell me about beef." In the figure, time goes from left to right and the vertical dimension represents hypothesis credibility on a scale from -100 to 100, as estimated by other KSs. SASS's problem is to find the most plausible sequence of temporally adjacent words. Plausibi l i ty is defined by the credibility of the individual word hypotheses and the grammatically and meaningfulness of the sequence. The concept of temporal adjacency is general ized to tolerate fuzzy word boundaries, overlap between successive words, si lences in the middle of word sequences, and unintel l igible intervals. Since some of the uttered words may not have been hypothesized, SASS must be able to expand the solut ion space by inferring the likely presence of a missing word on the basis of existing word hypotheses. Such inferences are relat ively weak since several predictions may be plausible in a g iven context. In the example of Figure 1, SASS hypothesizes the missing word "tel l" in the interval preceding "me about beef." Since SASS is uncertain as to which word hypotheses are correct , it also makes several incorrect word predictions. Figure 2 shows the words predicted by SASS on the basis of the words shown in Figure 1. The figures do not reflect the fact that the var ious hypotheses are generated at different times and SASS starts generat ing predictions prior to completion of the word recognit ion process. In order to control the potentially explosive search through this combinatorial and expanding solution space, SASS must be able to reflect the variable reliability of its inference rules and to relax its plausibility criteria dynamically so as to stimulate processing on unrecognized portions of the utterance. SASS must be able to use partial information to guide further process ing in useful directions. To avoid duplicated computation, SASS must store and use partial parses, which are intermediate computations (plausible subsequences) common to many potential parses. SASS must combine these partial parses into plausible complete parses, select the best complete parse, interpret the meaning of the recognized utterance, and respond appropriately. The problems faced by SASS -uncertainty, combinatorial search, fuzzy pattern-matching, strong and weak inferences, and the need to exploit partial information -are common to many large knowledge-based systems. Efficient solution of these problems appears to require a system organization in which the schedul ing of inferential processes is sensitive to various cooperat ive and competitive relationships among the inferred hypotheses. For example, processing should be facilitated on an hypothesis supported cooperatively by multiple sources of information. Conversely, processing should be inhibited on an hypothesis which competes -Le^ is inconsistent with a strongly credible hypothesis. Inhibition in an environment of uncerta inty must be implemented non-deterministically, since the weaker hypothesis may in fact be correct. Non-deterministic inhibit ion is ef fected in Hearsay II by a focus of attention mechanism which allocates computational resources so as to consider the most promising hypotheses before others (HayesRoth & Lesser, 1976). The approach used in SASS is relevant to pattern recognit ion for its fuzzy pattern-matching; to problem solving for its f lexible combination of bottom-up, top-down, fo rward inferencing, and problem reduction mechanisms; and to information retrieval and the problem of pattern-directed funct ion invocation for its efficient mechanism for continuously monitoring a data base for occurrences of any of a large number of relational patterns or templates. OVERVIEW OF METHOD Given a declarative (le^ non-procedural) description of the target language which our system is to understand, we need to convert it into behavior which is adequate to understand utterances in the language efficiently and robustly. Our approach has been to automate this conversion as much as possible. Syntactic and semantic knowledge about the \arge\ language is expressed in a compact, readable grammar. A compiler converts the grammar into precondition-response productions. The product ions are embedded in a recognition network to enable eff ic ient continuous monitoring of the blackboard for stimuli matching production preconditions. In general, many productions wil l be invocable at any given time. Various scheduling policies se rve to hasten the invocation of productions which are cons idered likely to generate useful (correct, relevant, and necessary) results and to inhibit or defer less promising invocations. LINGUISTIC KNOWLEDGE The grammar describing the target language is expressed using parameterized structural representations (PSRs), which are sets of attr ibute-object pairs. We use a PSR to define a class of words and phrases which can fulfill the same syntactic or semantic function in the target language. The current target language consists of simple English queries for a news retr ieval program. For example, the PSR (SCLASS: SQUERY, 8PNAME: "PARSED QUERY", <•: 8GIMME+8WHAT, c-: TELL+8ME+8RE+8TOPICS, U WHAT+HAPPENED+8 ANYWAY, <: WHAT+8BE+THE+8NEWS+8RE+8TOPICS, <: 8BE+THERE+8ANY+8PIECES+8RE+8TOPICS, FACTION: PASS, 8LEVEL: 300) def ines the class "8QUERY" of possible queries in terms of its alternative syntactic realizations. The attribute V denotes membership in the class. Each member of the class is a sequence template whose constituents, separated by "+", are words or phrases. Phrasal constituents are prefixed by "8" and def ined in turn by other PSRs. Additional attributes of the class are def ined by other components of the PSR. "FACTION: PASS" means that SASS's response upon recognizing an instance of any of the f ive templates in the class should be to treat it as an instance of SQUERY. The 8LEVEL attribute estimates the relative completeness of the partial parse underlying the hypothesized phrase. The PSR (SCLASS: ATOPICS, «: 8PLACE, <: 8F00D, <: ^TECHNOLOGY, (-: 8SCIENCE, €: ^GOVERNMENT, 0: ^POLITICS, (: 8PE0PLE, <-: 8TOPICS+8CONJUNCTION+8TOPICS, 8ACTI0N: PASS, 8LEVEL: 40) def ines the class of possible topics in the news in terms of its semantic subclasses. The grammar for the current 450-word target language consists of 113 PSRs. TYPES OF BEHAVIOR RULES SASS has a repertoire of strong and weak methods, represented by different types of behavior rules used in understanding. A recognit ion rule generates a phrase hypothesis in response to suff iciently credible hypotheses for the phrase's const ituents. SASS considers an hypothesized constituent to be recognizab le if its credibil ity rating, determined by other KSs, exceeds a minimum threshold for plausibility. The hypothesized const i tuents may also have to satisfy some structural condition such as temporal adjacency between sequential constituents of a phrase. A recognition rule represents a strong inference; its s t rength is the probabil ity that the recognized constituents can be in terpreted as an instance of the phrase. For example, "beef" can be interpreted as a food or as a complaint, depending on context. Recognition rules drive processing upward toward a complete parse of the utterance from plausible partial parses. Recognit ion behavior can be thought of as bottom-up parsing. A predict ion rule hypothesizes a word or phrase which is l ikely to occur in the context of a previously recognized port ion of the utterance. Prediction rules drive processing outward in time from "islands of plausibility," and are necessary since not all words in a spoken ^utterance may be recognized bottom-up by lower-level KSs. Predictive behavior can be thought of as f o rwa rd inferencing. \ The strength of a predictive inference is the conditional probabi l i ty that the predicted constituent occurs, g iven that its predictive context has been recognized. This s t rength is inversely related to the number of constituents which can plausibly occur in the given context. A respel l ing rule enumeratively hypothesizes the const i tuents of a predicted phrase, by subdividing an hypothes ized sequence into hypotheses for its sequential const ituents, or by splitting an hypothesized class into alternate hypotheses for its various members. Respelling rules dr ive process ing downward toward the word level, so that high-level phrasal predict ions can ultimately be tested word-by-word by lower-level KSs. Respelling can be thought of as top-down behavior or generation of subgoals from goals. Finally, a postdiction rule solicits post hoc support for (i.e., se rves to increase the credibil ity ratings of) existing hypotheses f rom other hypotheses in whose context they are plausible. Postdict ion rules include prediction and respell ing rules which are too weak to justify creation of hypotheses, but can contribute useful information when the hypotheses already exist. For example, an expectation for an instance of STOPICS fol lowing the w o r d "about" should not be respelled into hypotheses for all the nouns in the vocabulary, since to do so would explode the search space. However, once the word "beef" is hypothesized in the correct time interval on the basis of other knowledge, the hypothes is should receive support from the expectation for a topic word . Postdict ion rules serve three functions: they allow cooperat ion between inferences which support the same hypothes is on the basis of different evidence; they allow words and phrases hypothesized with initial low credibil ity ratings to be recogn ized on the basis of their contextual plausibility; and they help focus attention in productive directions by increasing the ratings of hypotheses which are contextually plausible (and thus re lat ively likely to be correct) so that processing on them is scheduled sooner. In the sense that postdiction responds to weak lyra ted hypotheses by seeking causal antecedents (predictors) for them, postdiction can be thought of as post hoc inferenc ing or "twenty-twenty hindsight " CONVERSION OF STATIC KNDWI FDGE TO RFHAVIOR RULES Most of the information necessary for understanding the target language is implicit in the grammar which describes it. The automatic conversion of this static information into a usable procedura l form is effected by a simple compiler called CVSNET, wh ich translates the PSRs into recognition, prediction, respel l ing, and postdict ion rules. A few rules hand-coded in explicit ly procedura l form are then added, for example a rule that prints a message when a sentence is recognized. The only linguistic knowledge in CVSNET itself is an elementary understanding of sequences and classes. CVSNET decomposes the sequence templates cj+C2+...+c n into pairs of subsequence templates. For example, from the sequence template TELL+8ME+8RE+8TOPICS, CVSNET generates the new templates 8ME+8RE+8TOPICS and 8RE+8TOPJCS. CVSNET then generates the appropriate rules for each template. The recognition rule for a sequence is to concatenate its hypothes ized subsequences provided they are temporally adjacent and sufficiently credible. The respell ing rule respel ls a pred ic ted sequence into its two subsequences. Prediction rules are generated to predict the remaining constituents of the sequence when a subsequence of it has been recognized. Similarly, CVSNET generates rules for recognizing an instance of a class from an hypothesized constituent of the class and for respel l ing a predicted class into its constituents. CVSNET estimates the strength of each such rule as an inverse function of class size. CVSNET also generates the relevant postdiction rules. Some of the rules generated from the PSRs are shown below; rule type is indicated by the type of arrow separating stimulus and response ("-*" for recognition, "=>" for prediction, "+>" for respel l ing, and "<=" for postdiction) and rule strength is shown in parentheses. TELL & 8ME -> TELL+8ME < CONCATENATE (100) (100) > TELL & 8ME <= TELL+8ME < POSTDICTISEQ (100) (100) > TELL+8ME +> TELL & HME < RESPELLISEQ (100) (100) > 8ME -> TELL < PREDICTJLEFT (50) > TELL <= 8ME < POSTDICTiLEFT (50) > TELL => 8ME+8RE+8TOPICS < PREDICTJRIGHT (100) > SME+8RE+8TOPICS <= TELL < POSTDICT.'RIGHT (100) > 8F00D -» 8T0PICS < PASS (100) > 8T0PICS +> 8F00D < RESPELUCLASS (70) > 8FOOD <= 8T0PICS < P0STDICT1ELEMENT (88) > The linguistic knowledge expressed compactly in the grammar is represented highly redundantly in the generated rules. This redundancy provides the basis for robust performance in the errorful domain of speech: in regions of the utterance where strong inferences (recognition rules) are inadequate (for example, because lower-level KSs have failed to hypothes ize some of the uttered words), weaker inferences must be appl ied in order for the utterance to be understood. IDENTIFICATION OF INVOCABLE RULES All of the rules described have the form [p r e cond i t i on^^, . . . , x n ) =>̂ re$ponse(xj,X2i...,xn)]> signifying that a spec i f ied response can be inferred with strength f from the objects X | , X2, x n whenever these objects are in the relat ionships descr ibed by the associated precondition. The large number of rules required even in a relatively simple system (over 3 000 rules for a 450-word vocabulary) necessitates an eff icient means of continuously monitoring the blackboard to determine wh ich rules are currently invocable because of data satisfying their preconditions. This problem is solved by embedding the rules in an automatically compilable recognition network (ACORN), as . d iscussed e lsewhere (Hayes-Roth & Mostow, 1975). In brief, each grammatical constituent (word or phrase) is assigned a unique node in the network. Rules whose preconditions refer to the constituent are stored at the node. Whenever an hypothesis for the constituent is created or revised, its node is activated and the relevant rules become invocable. PRINCIPLES OF CONTROL The rule preconditions are defined in terms of various thresholds for plausibility, temporal adjacency, etc. These thresholds can be given values specific to a particular region of the utterance and are dynamically modifiable. Thus rules are invoked not only in response to new hypotheses but also in response to local threshold changes. This mechanism allows f lex ib le matching of rule preconditions. Thresholds can be re laxed in unrecognized regions of the utterance to permit local ized application of methods whose weakness would cause combinatorial explosion if they were applied uniformly throughout the utterance. Hypotheses are explicitly linked in the data base to hypotheses which support them inferentially, and the links are marked with the strengths of the inferences. A rating pol icy module (RPOL) rates the plausibility of new hypotheses on the basis of the ratings of the hypotheses which support them and the strengths with which they do so. RPOL updates these ratings when an hypothesis receives new support or when the rating of one of its support ing hypotheses is changed. Hypotheses are rated separately on their contextual plausibility and on the extent to which they are supported by lower-level hypotheses. The combinatorial search can be controlled by modifying the appropr iate threshold values. For example, the search can be broadened or narrowed by relaxing or tightening criteria for recognizabi l i ty, since the solution space consists only of sequences of recognizable words. A best-first search policy can be implemented simply by ordering rule invocations according to the strengths of the rules and theplausibility ratings of the hypotheses matching the rules' preconditions. The search can be further focussed by inhibiting low-level processing within a reg ion already accounted for by a credible high-level hypothesis. Of course this policy must be pursued with caution since the highlevel hypothesis may be incorrect. Cautious inhibition is implemented as deferred processing. A similar policy of procrast inat ion can be used to defer application of weak inferences in a region until strong methods fail. An inferential process can be deferred by scheduling it with low priority (so that it may never in fact be executed), or by scheduling it only when the relevant thresholds are relaxed. The latter mechanism permits reconsideration of previously rejected alternatives. Discourse rules can also help to focus the search. For example, an hypothesis that the current topic of conversation is food increases the a priori probability that the word "beef" will be uttered. If we can predict subject matter or syntax from any one of many knowledge elements (e.g., a recognized cue word in the same utterance, semantic analysis of previous utterances, knov/ledge of the particular speaker's interests), we can create such an hypothesis. This form of semantic and syntactic priming is non-restr ict ive in that it does not preclude recognizing an utterance which is inconsistent with an hypothesized topic of conversat ion or an expectation for a particular grammatical construct ion. The mechanism is also graceful in that it does not impose a strict hierarchy of topical domains, and in fact tolerates ambiguity and uncertainty in the expectations generated by prev ious discourse. Inexact matching can also be carefully controlled with thresholds. An interval of silence in the middle of an utterance can be accepted by relaxing temporal adjacency thresholds in the region of the si lence so that hypothesized sequence constituents temporal ly separated by the silence will be considered temporal ly adjacent. For example, if the speaker says "Tell me about . . . beef," this mechanism allows the words "about" and "beef" to be considered temporally adjacent. Interjections and unclear intervals of speech can be nondeterministically ignored by treat ing them as silences. Sometimes the uttered words cannot be recognized by lower-level KSs even after SASS hypothes izes them on the basis of surrounding context. In such cases, partial ly-matched phrases can be recognized by lowering credib i l i ty thresholds in unintelligible intervals so that unfulfi l led expectat ions for missing constituents are treated as if they had been fulfi l led. These mechanisms can even be used to tolerate some variation from the target language by ignoring extra verb iage not accounted for in the grammar and by filling in omitted constituents required by the grammar. PERFORMANCE EVALUATION The contribution of e»arh k'Q u b . „ „ , „ , , 0 recognitor! of VZ££? '^IZ hand, the word-hypothesizer KS might eventually have lowered its own thresholds enough to have weakly hypothesized the missing "tell." In this case, SASS's postdiction of the hypothesized " te l l " from its surrounding context might have been critical in increasing its credibi l i ty rating sufficiently to permit it to be recognized. Despite the complex dynamics of the integrated system, we do have an evaluation methodology for SASS which will be pursued in the next year. Basically, our strategy is to generate a var ie ty of artificial problems, each defined by a set of hypothes ized words, and measure the elapsed time until SASS parses the utterance. In particular, we should be able to evaluate the relative efficacy of the four types of behavior rules in overcoming various kinds of error in the artificial input. If we can then estimate the relative frequencies of different kinds of e r ro rs generated by lower-level KSs, we can attempt to optimize SASS's behavioral profile. CONCLUSION There are many functions to be performed by a syntax and semantics knowledge source within a speech understanding system. In addition to simply parsing a sentence, the knowledge source must use a variety of strong and weak inferencing methods to hypothesize missing constituents and adduce support for exist ing hypotheses found in appropriate contexts. A product ion system using four types of rules has been developed to implement such desirable "knowledgeable" behaviors, which are automatically inferred from a simple declarative representat ion of the language to be understood. By making the invocat ion of a rule be dependent upon both the credibi l ity of the data matching the rule's preconditions and the estimated s t rength of the rule as a useful inference, the entire search process may be controlled so as to pursue dynamically modifiable global and local processing objectives. In sum, such a production system provides a general framework for represent ing "knowledgeable" syntactic and semantic behaviors. Moreover, the f ine computational grain of the behavior rules makes possible the f lex ib le and precise control needed to avoid a combinatorial exp los ion in the search for a plausible interpretation of cont inuous speech.
منابع مشابه
Reverse Engineering of Network Software Binary Codes for Identification of Syntax and Semantics of Protocol Messages
Reverse engineering of network applications especially from the security point of view is of high importance and interest. Many network applications use proprietary protocols which specifications are not publicly available. Reverse engineering of such applications could provide us with vital information to understand their embedded unknown protocols. This could facilitate many tasks including d...
متن کاملArtistic Expression in the Making of Sa’di’s Sonnet
Is literature aesthetic in its form? If it is, what causes this beauty, creates this appeal and results in this artistic expression? Understanding the aesthetics of literature, poetry in particular, is made possible through the understanding of grammar (morphology, syntax, phonetics and semantics). Consequently, in the literary canon of the world, aesthetics are methodical, systematic and acces...
متن کاملDesign and Implementation of an Intelligent Part of Speech Generator
The aim of this paper is to report on an attempt to design and implement an intelligent system capable of generating the correct part of speech for a given sentence while the sentence is totally new to the system and not stored in any database available to the system. It follows the same steps a normal individual does to provide the correct parts of speech using a natural language processor. It...
متن کاملIntegrating Speech and Natural Language
The overall goal of this project is to integrate speech and natural language knowledge sources to build a speech understanding system for human-machine communication using spoken English. The speech knowledge sources use acoustic models based on hidden Markov modeling techniques. The natural language knowledge sources use a Unification grammar formalism for describing the syntax of English, a h...
متن کاملA Speech Understanding System Based Upon A Co-Routine Parser
This paper gives a brief description of the speech understanding effort under Development at the Univ. of Toronto. The main purpose so far has been to produce a base from which further research into the "higher levels" of speech understanding (semantics , pragmatics, user models, syntax) may build on. Some features of interest in this system are the syllable based pattern recognition, the dynam...
متن کاملProject Summary: Linguistic Knowledge Sources for Spoken Language Understanding
The objective of the Unisys Spoken Language Systems effort is to develop and demonstrate technology for the understanding of goal-directed spontaneous speech input. The Unisys spoken language architecfure couples a speech recognition system (the MIT Summit system) with the Unisys discourse understanding system Pundit. Pundit is a broad-coverage language understanding system used in a variety of...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1976